-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serial extract_archive to prevent unnecessary extractions #786
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 74 74
Lines 3419 3439 +20
Branches 613 621 +8
=========================================
+ Hits 3419 3439 +20 ☔ View full report in Codecov by Sentry. |
071e0bc
to
c65939f
Compare
using this script from pathlib import Path
import pymovements as pm
pm.utils.archives.extract_archive(Path('gazebasevr.zip')) and predownloaded gazebase-vr -- I executed this command time python t.py for this branch: Extracting gazebasevr.zip to .
real 6m53,025s
user 0m31,852s
sys 0m8,898s for current main: Extracting gazebasevr.zip to .
real 7m12,829s
user 0m33,634s
sys 0m9,835s different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?)) |
161b3ec
to
265687f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks a lot! that was what I had in mind when creating #488.
We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True
would be already sufficient.
3bd6e20
to
75b0b34
Compare
e8eec21
to
4f4f0ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there
src/pymovements/utils/archives.py
Outdated
@@ -138,6 +138,7 @@ def _extract_tar( | |||
source_path: Path, | |||
destination_path: Path, | |||
compression: str | None, | |||
skip: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that skip
is a bit confusing as a name. I suggested continue
but that's a reserved keyword, so let's call it resume
. this way it's already clear what the argument does without reading the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done -- resume
src/pymovements/utils/archives.py
Outdated
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and | ||
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't really check for correct size. it's just checking if the archive member's size is greater than zero.
you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/pymovements/utils/archives.py
Outdated
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and | ||
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and | ||
skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it makes sense to check skip
or resume
first, because if it's False
you don't need to check all the other conditions and will extract either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is done by python on the fly
src/pymovements/utils/archives.py
Outdated
archive.extractall(destination_path, filter='tar') | ||
for member in archive.getnames(): | ||
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably break this up into separate conditions for readability.
create a variable like destination_filepath = os.path.join(destination_path, member)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/pymovements/utils/archives.py
Outdated
archive.extractall(destination_path) | ||
for member in archive.namelist(): | ||
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need the same logic here as in _extract_tar()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
), | ||
], | ||
) | ||
def test_extract_archive_destination_path_not_None_no_remove_top_level_no_remove_finished_twice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of I understand what you are testing here. the test would already pass without this PR, right?
As I don't think that it's worth much effort to test this thoroughly, I'd probably just check the calls to TarFile.extract()
for using the expected arguments via unittest.mock
(for the second extract_archive()
call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see adjustments via print
c04d614
to
1477ee3
Compare
for more information, see https://pre-commit.ci
resolves #488 eventually
currently, tox does not work locally for whatever reason.
TODO: